Buying and selling used smartphones used to be something that happened on a handful of online marketplace sites. But the used and refurbished phone market has grown considerably over the past decade, and a new IDC (International Data Corporation) forecast predicts that the used phone market would be worth $52.7bn by 2023 with a compound annual growth rate (CAGR) of 13.6% from 2018 to 2023. This growth can be attributed to an uptick in demand for used smartphones that offer considerable savings compared with new models.
Refurbished and used devices continue to provide cost-effective alternatives to both consumers and businesses that are looking to save money when purchasing a smartphone. There are plenty of other benefits associated with the used smartphone market. Used and refurbished devices can be sold with warranties and can also be insured with proof of purchase. Third-party vendors/platforms, such as Verizon, Amazon, etc., provide attractive offers to customers for refurbished smartphones. Maximizing the longevity of mobile phones through second-hand trade also reduces their environmental impact and helps in recycling and reducing waste. The impact of the COVID-19 outbreak may further boost the cheaper refurbished smartphone segment, as consumers cut back on discretionary spending and buy phones only for immediate needs.
The rising potential of this comparatively under-the-radar market fuels the need for an ML-based solution to develop a dynamic pricing strategy for used and refurbished smartphones. I have been hired as a Data Scientist by ReCell, a startup aiming to tap the potential in this market. They want me to analyze the data provided and build a linear regression model to predict the price of a used phone and identify factors that significantly influence it.
The data contains the different attributes of used/refurbished phones. The detailed data dictionary is given below
# this will help in making the Python code more structured automatically (good coding practice)
%load_ext nb_black
# Libraries to help with reading and manipulating data
import numpy as np
import pandas as pd
# Libraries to help with data visualization
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
# to split the data into train and test
from sklearn.model_selection import train_test_split
# to build linear regression_model
from sklearn.linear_model import LinearRegression
# to check model performance
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
# to build linear regression_model using statsmodels
import statsmodels.api as sm
data = pd.read_csv("used_phone_data.csv")
# copying data to another varaible to avoid any changes to original data
df = data.copy()
data.head()
| brand_name | os | screen_size | 4g | 5g | main_camera_mp | selfie_camera_mp | int_memory | ram | battery | weight | release_year | days_used | new_price | used_price | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Honor | Android | 23.97 | yes | no | 13.0 | 5.0 | 64.0 | 3.0 | 3020.0 | 146.0 | 2020 | 127 | 111.62 | 86.96 |
| 1 | Honor | Android | 28.10 | yes | yes | 13.0 | 16.0 | 128.0 | 8.0 | 4300.0 | 213.0 | 2020 | 325 | 249.39 | 161.49 |
| 2 | Honor | Android | 24.29 | yes | yes | 13.0 | 8.0 | 128.0 | 8.0 | 4200.0 | 213.0 | 2020 | 162 | 359.47 | 268.55 |
| 3 | Honor | Android | 26.04 | yes | yes | 13.0 | 8.0 | 64.0 | 6.0 | 7250.0 | 480.0 | 2020 | 345 | 278.93 | 180.23 |
| 4 | Honor | Android | 15.72 | yes | no | 13.0 | 8.0 | 64.0 | 3.0 | 5000.0 | 185.0 | 2020 | 293 | 140.87 | 103.80 |
# check number of rows and columns
data.shape
(3571, 15)
# checking column datatypes and number of non-null values
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 3571 entries, 0 to 3570 Data columns (total 15 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 brand_name 3571 non-null object 1 os 3571 non-null object 2 screen_size 3571 non-null float64 3 4g 3571 non-null object 4 5g 3571 non-null object 5 main_camera_mp 3391 non-null float64 6 selfie_camera_mp 3569 non-null float64 7 int_memory 3561 non-null float64 8 ram 3561 non-null float64 9 battery 3565 non-null float64 10 weight 3564 non-null float64 11 release_year 3571 non-null int64 12 days_used 3571 non-null int64 13 new_price 3571 non-null float64 14 used_price 3571 non-null float64 dtypes: float64(9), int64(2), object(4) memory usage: 418.6+ KB
Observations
# check the number of unique values in each column of the dataframe
data.nunique()
brand_name 34 os 4 screen_size 127 4g 2 5g 2 main_camera_mp 44 selfie_camera_mp 37 int_memory 16 ram 14 battery 354 weight 613 release_year 8 days_used 930 new_price 3099 used_price 3044 dtype: int64
Observations
# Renaming the columns 4g and 5g as they begin with numbers and unable to use with some functions
df = df.rename(columns={"4g": "fourg", "5g": "fiveg"})
# categorical column should be converted to categorical type
# (It reduces the data space required to store the dataframe,
# every class in the categorical column will be represented by a number under the hood.
# This is useful during model building)
df["brand_name"] = df.brand_name.astype("category")
df["os"] = df.os.astype("category")
df["fourg"] = df.fourg.astype("category")
df["fiveg"] = df.fiveg.astype("category")
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 3571 entries, 0 to 3570 Data columns (total 15 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 brand_name 3571 non-null category 1 os 3571 non-null category 2 screen_size 3571 non-null float64 3 fourg 3571 non-null category 4 fiveg 3571 non-null category 5 main_camera_mp 3391 non-null float64 6 selfie_camera_mp 3569 non-null float64 7 int_memory 3561 non-null float64 8 ram 3561 non-null float64 9 battery 3565 non-null float64 10 weight 3564 non-null float64 11 release_year 3571 non-null int64 12 days_used 3571 non-null int64 13 new_price 3571 non-null float64 14 used_price 3571 non-null float64 dtypes: category(4), float64(9), int64(2) memory usage: 322.7 KB
# checking for missing values
df.isnull().sum()
brand_name 0 os 0 screen_size 0 fourg 0 fiveg 0 main_camera_mp 180 selfie_camera_mp 2 int_memory 10 ram 10 battery 6 weight 7 release_year 0 days_used 0 new_price 0 used_price 0 dtype: int64
As there are less number of missing values in the Numeric Columns, we will replace the missing values in each column with its median.
medianFiller = lambda x: x.fillna(x.median())
numeric_columns = df.select_dtypes(include=np.number).columns.tolist()
df[numeric_columns] = df[numeric_columns].apply(medianFiller, axis=0)
# checking the number of missing values
df.isnull().sum()
brand_name 0 os 0 screen_size 0 fourg 0 fiveg 0 main_camera_mp 0 selfie_camera_mp 0 int_memory 0 ram 0 battery 0 weight 0 release_year 0 days_used 0 new_price 0 used_price 0 dtype: int64
# Let's look at the statistical summary of the data
pd.set_option(
"display.float_format", lambda x: "%.3f" % x
) # to display numbers rounded off to 3 decimal places
df.describe(include="all").T
| count | unique | top | freq | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| brand_name | 3571 | 34 | Others | 509 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| os | 3571 | 4 | Android | 3246 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| screen_size | 3571.000 | NaN | NaN | NaN | 14.804 | 5.153 | 2.700 | 12.700 | 13.490 | 16.510 | 46.360 |
| fourg | 3571 | 2 | yes | 2359 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| fiveg | 3571 | 2 | no | 3419 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| main_camera_mp | 3571.000 | NaN | NaN | NaN | 9.330 | 4.705 | 0.080 | 5.000 | 8.000 | 13.000 | 48.000 |
| selfie_camera_mp | 3571.000 | NaN | NaN | NaN | 6.546 | 6.878 | 0.300 | 2.000 | 5.000 | 8.000 | 32.000 |
| int_memory | 3571.000 | NaN | NaN | NaN | 54.470 | 84.586 | 0.005 | 16.000 | 32.000 | 64.000 | 1024.000 |
| ram | 3571.000 | NaN | NaN | NaN | 4.057 | 1.390 | 0.030 | 4.000 | 4.000 | 4.000 | 16.000 |
| battery | 3571.000 | NaN | NaN | NaN | 3067.113 | 1363.063 | 80.000 | 2100.000 | 3000.000 | 4000.000 | 12000.000 |
| weight | 3571.000 | NaN | NaN | NaN | 179.384 | 90.197 | 23.000 | 140.000 | 159.000 | 184.000 | 950.000 |
| release_year | 3571.000 | NaN | NaN | NaN | 2015.965 | 2.292 | 2013.000 | 2014.000 | 2016.000 | 2018.000 | 2020.000 |
| days_used | 3571.000 | NaN | NaN | NaN | 675.391 | 248.641 | 91.000 | 536.000 | 690.000 | 872.000 | 1094.000 |
| new_price | 3571.000 | NaN | NaN | NaN | 237.389 | 197.546 | 9.130 | 120.130 | 189.800 | 291.935 | 2560.200 |
| used_price | 3571.000 | NaN | NaN | NaN | 109.880 | 121.501 | 2.510 | 45.205 | 75.530 | 126.000 | 1916.540 |
Observations
df.duplicated().sum()
0
No duplicate values found
# function to plot a boxplot and a histogram along the same scale.
def histogram_boxplot(data, feature, figsize=(12, 7), kde=False, bins=None):
"""
Boxplot and histogram combined
data: dataframe
feature: dataframe column
figsize: size of figure (default (12,7))
kde: whether to the show density curve (default False)
bins: number of bins for histogram (default None)
"""
f2, (ax_box2, ax_hist2) = plt.subplots(
nrows=2, # Number of rows of the subplot grid= 2
sharex=True, # x-axis will be shared among all subplots
gridspec_kw={"height_ratios": (0.25, 0.75)},
figsize=figsize,
) # creating the 2 subplots
sns.boxplot(
data=data, x=feature, ax=ax_box2, showmeans=True, color="violet"
) # boxplot will be created and a star will indicate the mean value of the column
sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2, bins=bins, palette="winter"
) if bins else sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2
) # For histogram
ax_hist2.axvline(
data[feature].mean(), color="green", linestyle="--"
) # Add mean to the histogram
ax_hist2.axvline(
data[feature].median(), color="black", linestyle="-"
) # Add median to the histogram
histogram_boxplot(df, "screen_size")
Observations
histogram_boxplot(df, "main_camera_mp")
Observations
histogram_boxplot(df, "selfie_camera_mp")
Observations
histogram_boxplot(df, "int_memory")
Observations
histogram_boxplot(df, "ram")
Observations
histogram_boxplot(df, "battery")
Observations
histogram_boxplot(df, "weight")
Observations
histogram_boxplot(df, "release_year")
Observations
histogram_boxplot(df, "days_used")
Observations
histogram_boxplot(df, "new_price")
Observations
histogram_boxplot(df, "used_price")
Observations
# function to create labeled barplots
def labeled_barplot(data, feature, perc=False, n=None):
"""
Barplot with percentage at the top
data: dataframe
feature: dataframe column
perc: whether to display percentages instead of count (default is False)
n: displays the top n category levels (default is None, i.e., display all levels)
"""
total = len(data[feature]) # length of the column
count = data[feature].nunique()
if n is None:
plt.figure(figsize=(count + 1, 5))
else:
plt.figure(figsize=(n + 1, 5))
plt.xticks(rotation=90, fontsize=15)
ax = sns.countplot(
data=data,
x=feature,
palette="Paired",
order=data[feature].value_counts().index[:n].sort_values(),
)
for p in ax.patches:
if perc == True:
label = "{:.1f}%".format(
100 * p.get_height() / total
) # percentage of each class of the category
else:
label = p.get_height() # count of each level of the category
x = p.get_x() + p.get_width() / 2 # width of the plot
y = p.get_height() # height of the plot
ax.annotate(
label,
(x, y),
ha="center",
va="center",
size=12,
xytext=(0, 5),
textcoords="offset points",
) # annotate the percentage
plt.show() # show the plot
labeled_barplot(df, "brand_name", perc=True)
Observations
labeled_barplot(df, "os", perc=True)
Observations
labeled_barplot(df, "fourg", perc=True)
Observations
labeled_barplot(df, "fiveg", perc=True)
Observations
Plot bivariate charts between numeric variables to understand their interaction with each other.
numeric_columns = df.select_dtypes(include=np.number).columns.tolist()
# correlation heatmap
plt.figure(figsize=(15, 7))
sns.heatmap(
df[numeric_columns].corr(), annot=True, vmin=-1, vmax=1, fmt=".2f", cmap="Spectral",
)
plt.show()
Observations
sns.pairplot(data=df[numeric_columns], diag_kind="kde")
plt.show()
Observations
plt.figure(figsize=(30, 30))
sns.barplot(data=df, x="brand_name", y="ram")
<AxesSubplot:xlabel='brand_name', ylabel='ram'>
Observations
plt.figure(figsize=(30, 15))
sns.lineplot(x="weight", y="battery", data=df[df["battery"] > 4500], hue="brand_name")
plt.show()
Observations
plt.figure(figsize=(30, 15))
sns.countplot(data=df[df["screen_size"] > 6], x="brand_name")
plt.show()
Observations
plt.figure(figsize=(30, 15))
sns.countplot(data=df[df["selfie_camera_mp"] > 8], x="brand_name")
plt.show()
Observations
plt.figure(figsize=(10, 5))
sns.barplot(data=df, x="os", y="used_price")
plt.show()
Observations
plt.figure(figsize=(10, 5))
sns.barplot(data=df, x="fourg", y="used_price")
plt.show()
plt.figure(figsize=(10, 5))
sns.barplot(data=df, x="fiveg", y="used_price")
plt.show()
Obsservations
# Brand names and their average used_price
df.groupby("brand_name")["used_price"].mean()
plt.figure(figsize=(30, 5))
sns.barplot(data=df, y="used_price", x="brand_name")
plt.show()
# Creating a new column based on the mean_used_price of the brand_name.
df["mean_used_price"] = df.groupby("brand_name")["used_price"].transform("mean")
# Splitting the data into 3 equal bins based on the mean_used_price
df["Phone_Category"] = pd.qcut(df["mean_used_price"], 3, labels=(1, 2, 3))
# Renaming the Label Values
label_map = {1: "Budget", 2: "Mid_Range", 3: "High_End"}
df["Phone_Category"] = df["Phone_Category"].map(label_map)
# After Column Binning
# plt.figure(figsize=(5, 5))
# sns.barplot(data=df, y="mean_used_price", x="Phone_Category")
# plt.show()
labeled_barplot(df, "Phone_Category", perc=True)
Budget phone form 38.7 % of the total data whereas High End phones form only 24% of the total given data.
# drop the brand_name and mean_used_price columns
df.drop(["brand_name", "mean_used_price"], axis=1, inplace=True)
# let's plot the boxplots of all columns to check for outliers
df1 = df.select_dtypes(exclude=["category"])
for column in df1:
plt.figure(figsize=(17, 1))
sns.boxplot(data=df1, x=column)
Observations
def treat_outliers(df, col):
"""
treats outliers in a variable
col: str, name of the numerical variable
df: dataframe
col: name of the column
"""
Q1 = df[col].quantile(0.25) # 25th quantile
Q3 = df[col].quantile(0.75) # 75th quantile
IQR = Q3 - Q1
Lower_Whisker = Q1 - 1.5 * IQR
Upper_Whisker = Q3 + 1.5 * IQR
# all the values smaller than Lower_Whisker will be assigned the value of Lower_Whisker
# all the values greater than Upper_Whisker will be assigned the value of Upper_Whisker
df[col] = np.clip(df[col], Lower_Whisker, Upper_Whisker)
return df
def treat_outliers_all(df, col_list):
"""
treat outlier in all numerical variables
col_list: list of numerical variables
df: data frame
"""
for c in col_list:
df = treat_outliers(df, c)
return df
# treating the outliers
numerical_col = df.select_dtypes(include=np.number).columns.tolist()
df = treat_outliers_all(df, numerical_col)
# let's look at the boxplots to see if the outliers have been treated or not
df1 = df.select_dtypes(exclude=["category"])
for column in df1:
plt.figure(figsize=(17, 1))
sns.boxplot(data=df1, x=column)
Some features are very skewed and will likely behave better on the log scale. Lets transform new_price and used_price.
cols_to_log = ["new_price", "used_price"]
for colname in cols_to_log:
plt.hist(df[colname], bins=50)
plt.title(colname)
plt.show()
print(np.sum(df[colname] <= 0))
0
0
# transform new_price
plt.hist(np.log(df["new_price"]), 50)
plt.title("log(Wage + 1)")
plt.show()
plt.hist(np.arcsinh(df["new_price"]), 50)
plt.title("arcsinh(new_price)")
plt.show()
plt.hist(np.sqrt(df["new_price"]), 50)
plt.title("sqrt(new_price)")
plt.show()
# transform used_price
plt.hist(np.log(df["used_price"]), 50)
plt.title("log(used_price + 1)")
plt.show()
plt.hist(np.arcsinh(df["used_price"]), 50)
plt.title("arcsinh(used_price)")
plt.show()
plt.hist(np.sqrt(df["used_price"]), 50)
plt.title("sqrt(used_price)")
plt.show()
Observations
# replace columns with sqrt transformed columns
for colname in cols_to_log:
df[colname + "_sqrt"] = np.sqrt(df[colname])
df.drop(cols_to_log, axis=1, inplace=True)
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 3571 entries, 0 to 3570 Data columns (total 15 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 os 3571 non-null category 1 screen_size 3571 non-null float64 2 fourg 3571 non-null category 3 fiveg 3571 non-null category 4 main_camera_mp 3571 non-null float64 5 selfie_camera_mp 3571 non-null float64 6 int_memory 3571 non-null float64 7 ram 3571 non-null float64 8 battery 3571 non-null float64 9 weight 3571 non-null float64 10 release_year 3571 non-null int64 11 days_used 3571 non-null int64 12 Phone_Category 3571 non-null category 13 new_price_sqrt 3571 non-null float64 14 used_price_sqrt 3571 non-null float64 dtypes: category(4), float64(9), int64(2) memory usage: 321.5 KB
Before we proceed to build a model, we'll have to encode categorical features.
We'll split the data into train and test to be able to evaluate the model that we build on the train data.
We will build a Linear Regression model using the train data and then check it's performance.
# drop ram since it is a constant value
df.drop(["ram"], axis=1, inplace=True)
# defining X and y variables
X = df.drop(["used_price_sqrt"], axis=1)
y = df["used_price_sqrt"]
print(X.head())
print(y.head())
os screen_size fourg fiveg main_camera_mp selfie_camera_mp \ 0 Android 22.225 yes no 13.000 5.000 1 Android 22.225 yes yes 13.000 16.000 2 Android 22.225 yes yes 13.000 8.000 3 Android 22.225 yes yes 13.000 8.000 4 Android 15.720 yes no 13.000 8.000 int_memory battery weight release_year days_used Phone_Category \ 0 64.000 3020.000 146.000 2020 127 High_End 1 128.000 4300.000 213.000 2020 325 High_End 2 128.000 4200.000 213.000 2020 162 High_End 3 64.000 6850.000 250.000 2020 345 High_End 4 64.000 5000.000 185.000 2020 293 High_End new_price_sqrt 0 10.565 1 15.792 2 18.960 3 16.701 4 11.869 0 9.325 1 12.708 2 15.722 3 13.425 4 10.188 Name: used_price_sqrt, dtype: float64
X = pd.get_dummies(
X, columns=X.select_dtypes(include=["category"]).columns.tolist(), drop_first=True,
)
X.head()
| screen_size | main_camera_mp | selfie_camera_mp | int_memory | battery | weight | release_year | days_used | new_price_sqrt | os_Others | os_Windows | os_iOS | fourg_yes | fiveg_yes | Phone_Category_Mid_Range | Phone_Category_High_End | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 22.225 | 13.000 | 5.000 | 64.000 | 3020.000 | 146.000 | 2020 | 127 | 10.565 | 0 | 0 | 0 | 1 | 0 | 0 | 1 |
| 1 | 22.225 | 13.000 | 16.000 | 128.000 | 4300.000 | 213.000 | 2020 | 325 | 15.792 | 0 | 0 | 0 | 1 | 1 | 0 | 1 |
| 2 | 22.225 | 13.000 | 8.000 | 128.000 | 4200.000 | 213.000 | 2020 | 162 | 18.960 | 0 | 0 | 0 | 1 | 1 | 0 | 1 |
| 3 | 22.225 | 13.000 | 8.000 | 64.000 | 6850.000 | 250.000 | 2020 | 345 | 16.701 | 0 | 0 | 0 | 1 | 1 | 0 | 1 |
| 4 | 15.720 | 13.000 | 8.000 | 64.000 | 5000.000 | 185.000 | 2020 | 293 | 11.869 | 0 | 0 | 0 | 1 | 0 | 0 | 1 |
# splitting the data in 70:30 ratio for train to test data
x_train, x_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, random_state=42
)
print("Number of rows in train data =", x_train.shape[0])
print("Number of rows in test data =", x_test.shape[0])
Number of rows in train data = 2499 Number of rows in test data = 1072
# fitting the model on the train data (70% of the whole data)
from sklearn.linear_model import LinearRegression
linearregression = LinearRegression()
linearregression.fit(x_train, y_train)
LinearRegression()
Let's check the coefficients and intercept of the model.
coef_df = pd.DataFrame(
np.append(linearregression.coef_, linearregression.intercept_),
index=x_train.columns.tolist() + ["Intercept"],
columns=["Coefficients"],
)
coef_df
| Coefficients | |
|---|---|
| screen_size | 0.009 |
| main_camera_mp | 0.001 |
| selfie_camera_mp | 0.025 |
| int_memory | 0.002 |
| battery | 0.000 |
| weight | -0.000 |
| release_year | -0.008 |
| days_used | -0.005 |
| new_price_sqrt | 0.600 |
| os_Others | -0.159 |
| os_Windows | -0.037 |
| os_iOS | 0.240 |
| fourg_yes | -0.087 |
| fiveg_yes | -0.408 |
| Phone_Category_Mid_Range | 0.033 |
| Phone_Category_High_End | 0.058 |
| Intercept | 19.325 |
Let's check the performance of the model using different metrics.
# function to compute adjusted R-squared
def adj_r2_score(predictors, targets, predictions):
r2 = r2_score(targets, predictions)
n = predictors.shape[0]
k = predictors.shape[1]
return 1 - ((1 - r2) * (n - 1) / (n - k - 1))
# function to compute MAPE
def mape_score(targets, predictions):
return np.mean(np.abs(targets - predictions) / targets) * 100
# function to compute different metrics to check performance of a regression model
def model_performance_regression(model, predictors, target):
"""
Function to compute different metrics to check regression model performance
model: regressor
predictors: independent variables
target: dependent variable
"""
# predicting using the independent variables
pred = model.predict(predictors)
r2 = r2_score(target, pred) # to compute R-squared
adjr2 = adj_r2_score(predictors, target, pred) # to compute adjusted R-squared
rmse = np.sqrt(mean_squared_error(target, pred)) # to compute RMSE
mae = mean_absolute_error(target, pred) # to compute MAE
mape = mape_score(target, pred) # to compute MAPE
# creating a dataframe of metrics
df_perf = pd.DataFrame(
{
"RMSE": rmse,
"MAE": mae,
"R-squared": r2,
"Adj. R-squared": adjr2,
"MAPE": mape,
},
index=[0],
)
return df_perf
# checking model performance on train set (seen 70% data)
print("Training Performance\n")
linearregression_train_perf = model_performance_regression(
linearregression, x_train, y_train
)
linearregression_train_perf
Training Performance
| RMSE | MAE | R-squared | Adj. R-squared | MAPE | |
|---|---|---|---|---|---|
| 0 | 0.542 | 0.413 | 0.973 | 0.973 | 4.921 |
Observations
The training R2 is 97.3%, indicating that the model explains 97.3% of the variation in the train data. So, the model is not underfitting.
MAE and RMSE on the train and test sets are comparable, which shows that the model is not overfitting.
MAE indicates that our current model is able to predict used_price within a mean error of 0.413 on the test data.
MAPE on the test set suggests we can predict within 4.92% of the used_price.
# unlike sklearn, statsmodels does not add a constant to the data on its own
# we have to add the constant manually
x_train1 = sm.add_constant(x_train)
# adding constant to the test data
x_test1 = sm.add_constant(x_test)
olsmod0 = sm.OLS(y_train, x_train1).fit()
print(olsmod0.summary())
OLS Regression Results
==============================================================================
Dep. Variable: used_price_sqrt R-squared: 0.973
Model: OLS Adj. R-squared: 0.973
Method: Least Squares F-statistic: 5533.
Date: Fri, 01 Oct 2021 Prob (F-statistic): 0.00
Time: 14:35:02 Log-Likelihood: -2015.0
No. Observations: 2499 AIC: 4064.
Df Residuals: 2482 BIC: 4163.
Df Model: 16
Covariance Type: nonrobust
============================================================================================
coef std err t P>|t| [0.025 0.975]
--------------------------------------------------------------------------------------------
const 19.3250 20.839 0.927 0.354 -21.538 60.188
screen_size 0.0094 0.005 1.957 0.050 -1.65e-05 0.019
main_camera_mp 0.0009 0.003 0.259 0.795 -0.006 0.008
selfie_camera_mp 0.0252 0.004 6.417 0.000 0.018 0.033
int_memory 0.0018 0.000 4.619 0.000 0.001 0.003
battery 7.005e-06 1.64e-05 0.427 0.669 -2.51e-05 3.92e-05
weight -0.0002 0.000 -0.329 0.742 -0.001 0.001
release_year -0.0079 0.010 -0.763 0.445 -0.028 0.012
days_used -0.0045 6.94e-05 -65.217 0.000 -0.005 -0.004
new_price_sqrt 0.6004 0.004 155.914 0.000 0.593 0.608
os_Others -0.1594 0.053 -3.026 0.003 -0.263 -0.056
os_Windows -0.0374 0.081 -0.464 0.643 -0.196 0.121
os_iOS 0.2398 0.098 2.439 0.015 0.047 0.433
fourg_yes -0.0868 0.034 -2.531 0.011 -0.154 -0.020
fiveg_yes -0.4078 0.065 -6.265 0.000 -0.536 -0.280
Phone_Category_Mid_Range 0.0331 0.028 1.195 0.232 -0.021 0.087
Phone_Category_High_End 0.0577 0.035 1.670 0.095 -0.010 0.126
==============================================================================
Omnibus: 96.524 Durbin-Watson: 1.991
Prob(Omnibus): 0.000 Jarque-Bera (JB): 187.169
Skew: 0.278 Prob(JB): 2.27e-41
Kurtosis: 4.220 Cond. No. 7.34e+06
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 7.34e+06. This might indicate that there are
strong multicollinearity or other numerical problems.
Observations
Negative values of the coefficient show that used_price decreases with the increase of corresponding attribute value.
Positive values of the coefficient show that used_price increases with the increase of corresponding attribute value.
p-value of a variable indicates if the variable is significant or not. If we consider the significance level to be 0.05 (5%), then any variable with a p-value less than 0.05 would be considered significant.
But these variables might contain multicollinearity, which will affect the p-values.
So, we need to deal with multicollinearity and check the other assumptions of linear regression first, and then look at the p-values.
We will be checking the following Linear Regression assumptions:
No Multicollinearity
Linearity of variables
Independence of error terms
Normality of error terms
No Heteroscedasticity
Multicollinearity occurs when predictor variables in a regression model are correlated. This correlation is a problem because predictor variables should be independent. If the correlation between variables is high, it can cause problems when we fit the model and interpret the results. When we have multicollinearity in the linear model, the coefficients that the model suggests are unreliable.
There are different ways of detecting (or testing) multicollinearity. One such way is by using the Variance Inflation Factor, or VIF.
Variance Inflation Factor (VIF): Variance inflation factors measure the inflation in the variances of the regression parameter estimates due to collinearities that exist among the predictors. It is a measure of how much the variance of the estimated regression coefficient $\beta_k$ is "inflated" by the existence of correlation among the predictor variables in the model.
General Rule of thumb:
from statsmodels.stats.outliers_influence import variance_inflation_factor
# we will define a function to check VIF
def checking_vif(predictors):
vif = pd.DataFrame()
vif["feature"] = predictors.columns
# calculating VIF for each feature
vif["VIF"] = [
variance_inflation_factor(predictors.values, i)
for i in range(len(predictors.columns))
]
return vif
checking_vif(x_train1)
| feature | VIF | |
|---|---|---|
| 0 | const | 3670073.117 |
| 1 | screen_size | 3.352 |
| 2 | main_camera_mp | 2.025 |
| 3 | selfie_camera_mp | 3.655 |
| 4 | int_memory | 1.879 |
| 5 | battery | 3.587 |
| 6 | weight | 3.093 |
| 7 | release_year | 4.767 |
| 8 | days_used | 2.561 |
| 9 | new_price_sqrt | 2.512 |
| 10 | os_Others | 1.339 |
| 11 | os_Windows | 1.036 |
| 12 | os_iOS | 1.192 |
| 13 | fourg_yes | 2.250 |
| 14 | fiveg_yes | 1.416 |
| 15 | Phone_Category_Mid_Range | 1.518 |
| 16 | Phone_Category_High_End | 1.853 |
Observations
Since the VIF score for all the variables are less than 5 there is low multicollinearity and we do not have to To remove multicollinearity
Why the test?
How to check linearity and independence?
How to fix if this assumption is not followed?
# let us create a dataframe with actual, fitted and residual values
df_pred = pd.DataFrame()
df_pred["Actual Values"] = y_train # actual values
df_pred["Fitted Values"] = olsmod0.fittedvalues # predicted values
df_pred["Residuals"] = olsmod0.resid # residuals
df_pred.head()
| Actual Values | Fitted Values | Residuals | |
|---|---|---|---|
| 844 | 10.024 | 9.871 | 0.153 |
| 1539 | 10.568 | 10.726 | -0.158 |
| 3452 | 10.672 | 10.332 | 0.340 |
| 1727 | 8.006 | 8.382 | -0.377 |
| 1926 | 8.243 | 8.077 | 0.166 |
# let's plot the fitted values vs residuals
sns.residplot(
data=df_pred, x="Fitted Values", y="Residuals", color="purple", lowess=True
)
plt.xlabel("Fitted Values")
plt.ylabel("Residuals")
plt.title("Fitted vs Residual plot")
plt.show()
Observations
The scatter plot shows the distribution of residuals (errors) vs fitted values (predicted values).
If there exist any pattern in this plot, we consider it as signs of non-linearity in the data and a pattern means that the model doesn't capture non-linear effects.
We see no pattern in the plot above. Hence, the assumptions of linearity and independence are satisfied.
Why the test?
How to check normality?
How to fix if this assumption is not followed?
sns.histplot(data=df_pred, x="Residuals", kde=True)
plt.title("Normality of residuals")
plt.show()
import pylab
import scipy.stats as stats
stats.probplot(df_pred["Residuals"], dist="norm", plot=pylab)
plt.show()
stats.shapiro(df_pred["Residuals"])
ShapiroResult(statistic=0.9872101545333862, pvalue=3.406625906387552e-14)
Homoscedascity: If the variance of the residuals is symmetrically distributed across the regression line, then the data is said to be homoscedastic.
Heteroscedascity: If the variance is unequal for the residuals across the regression line, then the data is said to be heteroscedastic.
Why the test?
How to check for homoscedasticity?
How to fix if this assumption is not followed?
import statsmodels.stats.api as sms
from statsmodels.compat import lzip
name = ["F statistic", "p-value"]
test = sms.het_goldfeldquandt(df_pred["Residuals"], x_train1)
lzip(name, test)
[('F statistic', 1.1494064145017795), ('p-value', 0.007286988184836292)]
Observations
Since p-value < 0.05, we can say that the residuals are Heteroscedastic. Still we will continue with the model as the model is working fine with Budget and Mid Range phones. It could be due to outliers with High End phone price. Also we are provided with limited number of observations that restricts our ability to test.
Now that we have checked all the assumptions of linear regression and they are satisfied, we can move towards the prediction part.
# predictions on the test set
pred = olsmod0.predict(x_test1)
df_pred_test = pd.DataFrame({"Actual": y_test, "Predicted": pred})
df_pred_test.sample(10, random_state=1)
| Actual | Predicted | |
|---|---|---|
| 2098 | 5.524 | 5.357 |
| 278 | 13.988 | 13.758 |
| 26 | 15.722 | 15.479 |
| 2910 | 9.485 | 9.312 |
| 2631 | 8.319 | 7.961 |
| 1582 | 9.465 | 10.349 |
| 2110 | 15.722 | 16.920 |
| 3160 | 8.083 | 7.870 |
| 2817 | 10.760 | 10.314 |
| 549 | 6.268 | 6.668 |
We can observe here that our model has returned pretty good prediction results, and the actual and predicted values are comparable.
We can also visualize comparison result as a bar graph.
Note: As the number of records is large, for representation purpose, we are taking a sample of 25 records only.
df1 = df_pred_test.sample(25, random_state=1)
df1.plot(kind="bar", figsize=(15, 7))
plt.show()
# checking model performance on train set (seen 70% data)
print("Training Performance\n")
olsmod0_train_perf = model_performance_regression(olsmod0, x_train1, y_train)
olsmod0_train_perf
Training Performance
| RMSE | MAE | R-squared | Adj. R-squared | MAPE | |
|---|---|---|---|---|---|
| 0 | 0.542 | 0.413 | 0.973 | 0.973 | 4.921 |
# checking model performance on test set (seen 30% data)
print("Test Performance\n")
olsmod0_test_perf = model_performance_regression(olsmod0, x_test1, y_test)
olsmod0_test_perf
Test Performance
| RMSE | MAE | R-squared | Adj. R-squared | MAPE | |
|---|---|---|---|---|---|
| 0 | 0.527 | 0.407 | 0.974 | 0.974 | 4.651 |
The model is able to explain ~97% of the variation in the data, which is very good.
The train and test RMSE and MAE are low and comparable. So, our model is not suffering from overfitting.
The MAPE on the test set suggests we can predict within 4.65% of the used_price.
Hence, we can conclude the model olsmod0 is good for prediction as well as inference purposes.
Let's compare the initial model created with sklearn and the final statsmodels model.
# training performance comparison
models_train_comp_df = pd.concat(
[linearregression_train_perf.T, olsmod0_train_perf.T], axis=1,
)
models_train_comp_df.columns = [
"Linear Regression sklearn",
"Linear Regression statsmodels",
]
print("Training performance comparison:")
models_train_comp_df
Training performance comparison:
| Linear Regression sklearn | Linear Regression statsmodels | |
|---|---|---|
| RMSE | 0.542 | 0.542 |
| MAE | 0.413 | 0.413 |
| R-squared | 0.973 | 0.973 |
| Adj. R-squared | 0.973 | 0.973 |
| MAPE | 4.921 | 4.921 |
Let's recreate the final statsmodels model and print it's summary to gain insights.
olsmodel_final = sm.OLS(y_train, x_train1).fit()
print(olsmodel_final.summary())
OLS Regression Results
==============================================================================
Dep. Variable: used_price_sqrt R-squared: 0.973
Model: OLS Adj. R-squared: 0.973
Method: Least Squares F-statistic: 5533.
Date: Fri, 01 Oct 2021 Prob (F-statistic): 0.00
Time: 14:35:09 Log-Likelihood: -2015.0
No. Observations: 2499 AIC: 4064.
Df Residuals: 2482 BIC: 4163.
Df Model: 16
Covariance Type: nonrobust
============================================================================================
coef std err t P>|t| [0.025 0.975]
--------------------------------------------------------------------------------------------
const 19.3250 20.839 0.927 0.354 -21.538 60.188
screen_size 0.0094 0.005 1.957 0.050 -1.65e-05 0.019
main_camera_mp 0.0009 0.003 0.259 0.795 -0.006 0.008
selfie_camera_mp 0.0252 0.004 6.417 0.000 0.018 0.033
int_memory 0.0018 0.000 4.619 0.000 0.001 0.003
battery 7.005e-06 1.64e-05 0.427 0.669 -2.51e-05 3.92e-05
weight -0.0002 0.000 -0.329 0.742 -0.001 0.001
release_year -0.0079 0.010 -0.763 0.445 -0.028 0.012
days_used -0.0045 6.94e-05 -65.217 0.000 -0.005 -0.004
new_price_sqrt 0.6004 0.004 155.914 0.000 0.593 0.608
os_Others -0.1594 0.053 -3.026 0.003 -0.263 -0.056
os_Windows -0.0374 0.081 -0.464 0.643 -0.196 0.121
os_iOS 0.2398 0.098 2.439 0.015 0.047 0.433
fourg_yes -0.0868 0.034 -2.531 0.011 -0.154 -0.020
fiveg_yes -0.4078 0.065 -6.265 0.000 -0.536 -0.280
Phone_Category_Mid_Range 0.0331 0.028 1.195 0.232 -0.021 0.087
Phone_Category_High_End 0.0577 0.035 1.670 0.095 -0.010 0.126
==============================================================================
Omnibus: 96.524 Durbin-Watson: 1.991
Prob(Omnibus): 0.000 Jarque-Bera (JB): 187.169
Skew: 0.278 Prob(JB): 2.27e-41
Kurtosis: 4.220 Cond. No. 7.34e+06
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 7.34e+06. This might indicate that there are
strong multicollinearity or other numerical problems.
Phones with 4g decreases the used_price by 0.0868 euros as compared to phones without 4g. Phones with 5g decreases the used_price by 0.41 euros as compared to phones without 5g.
Mid Range Phones and High End phones increases the used_price by 0.033 and 0.057 euros as compared to Budget phones.
# let us try pandas-profiling now and see how does it simplifies the EDA
# !pip install pandas-profiling==2.8.0
pip install pandas-profiling
ERROR:root:Cannot parse: 1:4: pip install pandas-profiling
Traceback (most recent call last):
File "C:\Users\aswat\anaconda3\lib\site-packages\lab_black.py", line 218, in format_cell
formatted_code = _format_code(cell)
File "C:\Users\aswat\anaconda3\lib\site-packages\lab_black.py", line 29, in _format_code
return format_str(src_contents=code, mode=FileMode())
File "C:\Users\aswat\anaconda3\lib\site-packages\black.py", line 725, in format_str
src_node = lib2to3_parse(src_contents.lstrip(), mode.target_versions)
File "C:\Users\aswat\anaconda3\lib\site-packages\black.py", line 836, in lib2to3_parse
raise exc from None
black.InvalidInput: Cannot parse: 1:4: pip install pandas-profiling
Requirement already satisfied: pandas-profiling in c:\users\aswat\anaconda3\lib\site-packages (3.0.0) Requirement already satisfied: matplotlib>=3.2.0 in c:\users\aswat\anaconda3\lib\site-packages (from pandas-profiling) (3.3.4) Requirement already satisfied: tangled-up-in-unicode==0.1.0 in c:\users\aswat\anaconda3\lib\site-packages (from pandas-profiling) (0.1.0) Requirement already satisfied: pydantic>=1.8.1 in c:\users\aswat\anaconda3\lib\site-packages (from pandas-profiling) (1.8.2) Requirement already satisfied: missingno>=0.4.2 in c:\users\aswat\anaconda3\lib\site-packages (from pandas-profiling) (0.5.0) Requirement already satisfied: tqdm>=4.48.2 in c:\users\aswat\anaconda3\lib\site-packages (from pandas-profiling) (4.59.0) Requirement already satisfied: seaborn>=0.10.1 in c:\users\aswat\anaconda3\lib\site-packages (from pandas-profiling) (0.11.1) Requirement already satisfied: numpy>=1.16.0 in c:\users\aswat\anaconda3\lib\site-packages (from pandas-profiling) (1.20.1) Requirement already satisfied: joblib in c:\users\aswat\anaconda3\lib\site-packages (from pandas-profiling) (1.0.1) Requirement already satisfied: requests>=2.24.0 in c:\users\aswat\anaconda3\lib\site-packages (from pandas-profiling) (2.25.1) Requirement already satisfied: phik>=0.11.1 in c:\users\aswat\anaconda3\lib\site-packages (from pandas-profiling) (0.12.0) Requirement already satisfied: PyYAML>=5.0.0 in c:\users\aswat\anaconda3\lib\site-packages (from pandas-profiling) (5.4.1) Requirement already satisfied: visions[type_image_path]==0.7.1 in c:\users\aswat\anaconda3\lib\site-packages (from pandas-profiling) (0.7.1) Requirement already satisfied: htmlmin>=0.1.12 in c:\users\aswat\anaconda3\lib\site-packages (from pandas-profiling) (0.1.12) Requirement already satisfied: scipy>=1.4.1 in c:\users\aswat\anaconda3\lib\site-packages (from pandas-profiling) (1.6.1) Requirement already satisfied: pandas!=1.0.0,!=1.0.1,!=1.0.2,!=1.1.0,>=0.25.3 in c:\users\aswat\anaconda3\lib\site-packages (from pandas-profiling) (1.2.4) Requirement already satisfied: jinja2>=2.11.1 in c:\users\aswat\anaconda3\lib\site-packages (from pandas-profiling) (2.11.3) Requirement already satisfied: bottleneck in c:\users\aswat\anaconda3\lib\site-packages (from visions[type_image_path]==0.7.1->pandas-profiling) (1.3.2) Requirement already satisfied: attrs>=19.3.0 in c:\users\aswat\anaconda3\lib\site-packages (from visions[type_image_path]==0.7.1->pandas-profiling) (20.3.0) Requirement already satisfied: multimethod==1.4 in c:\users\aswat\anaconda3\lib\site-packages (from visions[type_image_path]==0.7.1->pandas-profiling) (1.4) Requirement already satisfied: networkx>=2.4 in c:\users\aswat\anaconda3\lib\site-packages (from visions[type_image_path]==0.7.1->pandas-profiling) (2.5) Requirement already satisfied: Pillow in c:\users\aswat\anaconda3\lib\site-packages (from visions[type_image_path]==0.7.1->pandas-profiling) (8.2.0) Requirement already satisfied: imagehash in c:\users\aswat\anaconda3\lib\site-packages (from visions[type_image_path]==0.7.1->pandas-profiling) (4.2.1) Requirement already satisfied: MarkupSafe>=0.23 in c:\users\aswat\anaconda3\lib\site-packages (from jinja2>=2.11.1->pandas-profiling) (1.1.1) Requirement already satisfied: python-dateutil>=2.1 in c:\users\aswat\anaconda3\lib\site-packages (from matplotlib>=3.2.0->pandas-profiling) (2.8.1) Requirement already satisfied: kiwisolver>=1.0.1 in c:\users\aswat\anaconda3\lib\site-packages (from matplotlib>=3.2.0->pandas-profiling) (1.3.1) Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.3 in c:\users\aswat\anaconda3\lib\site-packages (from matplotlib>=3.2.0->pandas-profiling) (2.4.7) Requirement already satisfied: cycler>=0.10 in c:\users\aswat\anaconda3\lib\site-packages (from matplotlib>=3.2.0->pandas-profiling) (0.10.0) Requirement already satisfied: six in c:\users\aswat\anaconda3\lib\site-packages (from cycler>=0.10->matplotlib>=3.2.0->pandas-profiling) (1.15.0) Requirement already satisfied: decorator>=4.3.0 in c:\users\aswat\anaconda3\lib\site-packages (from networkx>=2.4->visions[type_image_path]==0.7.1->pandas-profiling) (5.0.6) Requirement already satisfied: pytz>=2017.3 in c:\users\aswat\anaconda3\lib\site-packages (from pandas!=1.0.0,!=1.0.1,!=1.0.2,!=1.1.0,>=0.25.3->pandas-profiling) (2021.1) Requirement already satisfied: typing-extensions>=3.7.4.3 in c:\users\aswat\anaconda3\lib\site-packages (from pydantic>=1.8.1->pandas-profiling) (3.7.4.3) Requirement already satisfied: idna<3,>=2.5 in c:\users\aswat\anaconda3\lib\site-packages (from requests>=2.24.0->pandas-profiling) (2.10) Requirement already satisfied: chardet<5,>=3.0.2 in c:\users\aswat\anaconda3\lib\site-packages (from requests>=2.24.0->pandas-profiling) (4.0.0) Requirement already satisfied: urllib3<1.27,>=1.21.1 in c:\users\aswat\anaconda3\lib\site-packages (from requests>=2.24.0->pandas-profiling) (1.26.4) Requirement already satisfied: certifi>=2017.4.17 in c:\users\aswat\anaconda3\lib\site-packages (from requests>=2.24.0->pandas-profiling) (2020.12.5) Requirement already satisfied: PyWavelets in c:\users\aswat\anaconda3\lib\site-packages (from imagehash->visions[type_image_path]==0.7.1->pandas-profiling) (1.1.1) Note: you may need to restart the kernel to use updated packages.
# import pandas_profiling
from pandas_profiling import ProfileReport
# Use the original dataframe, so that original features are considered
prof = ProfileReport(df)
# to view report created by pandas profile
prof